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5P Abstract 

-^ The paper introduces the sweeping preconditioner, which is highly efficient for it- 
erative solutions of the variable coefficient Helmholtz equation including very high fre- 
quency problems. The first central idea of this novel approach is to construct an ap- 

I— I proximate factorization of the discretized Helmholtz equation by sweeping the domain 

-^ layer by layer, starting from an absorbing layer or boundary condition. Given this spe- 

^' cific order of factorization, the second central idea of this approach is to represent the 

• intermediate matrices in the hierarchical matrix framework. In two dimensions, both 

^"pi the construction and the application of the preconditioners are of linear complexity. 

C^ The GMRES solver with the resulting preconditioner converges in an amazingly small 

^ number of iterations, which is essentially independent of the number of unknowns. This 

'""' approach is also extended to the three dimensional case with some success. Numerical 

^v^ results are provided in both two and three dimensions to demonstrate the efficiency of 

^ this new approach. 

o 

On Keyw'ords. Helmholtz equation, perfectly matched layer, absorbing boundary condi- 

^ tion, high frequency waves, preconditioner, LDL^ factorization. Green's function, matrix 

compression, hierarchical matrices. 

AMS subject classifications. 65F08, 65N22, 65N80. 



• 

o 
o 



1 Introduction 



/\^ This is the first of a series of papers on developing efficient preconditioners for the numerical 

^ solutions of the Helmholtz equation in two and three dimensions. The efficiency of precon- 

ditioners for the Helmholtz equation in the important high frequency range are at present 
much lower than that of preconditioners of typical elliptic problems. This paper develops 
efficient preconditioners of the Helmholtz equation by exploiting the physical property of 
the wave phenomena and certain low rank interaction properties of the Green's function. 

Let the domain of interest be the unit box D = (0,1)^ with d = 2,3. The time- 
independent wave field u(x) for x G D satisfies 

Au{x) + ^--u{x) = f{x), (1) 

where cu is the angular frequency and c(x) is the velocity field and f{x) is the external 
force. Commonly used boundary conditions are approximations of the Sommerfeld condi- 
tion which guarantees that the wave field generated by f{x) propagates out of the domain. 



Other boundary condition for part of the boundary will also be considered. By appro- 
priately rescaling the system, it is convenient to assume that the mean of c{x) is around 
1. Then ^ is the (average) wave number of this problem and A = ^ is the (typical) 
wavelength. 

The Helmholtz equation is ubiquitous since it is the root of almost all linear wave 
phenomena. Applications of the Helmholtz equation are abundant in acoustics, elasticity, 
electromagnetics, quantum mechanics, and geophysics. As a result, efficient and accurate 
numerical solution of the Helmholtz problem is one of the urgent problems in computational 
mathematics. This is, however, a very difficult problem due to two main reasons. Firstly, 
in a typical engineering application, the Helmholtz equation is discretized with at least 8 
to 16 points per wavelength. Therefore, the number of samples n in each dimension is 
proportional to cj, the total number of samples N is n^ = 0{uj^)^ and the discrete system 
of the Helmholtz equation is of size 0{uj^) x 0{uj^). In the high frequency range when uj is 
large, this is an enormous system. Secondly, as the discrete system is highly indefinite and 
has a very oscillatory Green's function due to the wave nature of the Helmholtz equation, 
most of the modern multiscale techniques developed for elliptic or parabolic problems are 
no longer effective. 

1.1 Approach and contribution 

In this paper, we propose a sweeping preconditioner ioi the iterative solution of the Helmholtz 
equation. In all examples, the Helmholtz equation is discretized by centered finite differ- 
ences, i.e., the 5-point stencil in 2D and the 7-point stencil in 3D. 

In the 2D case, this new preconditioner is based on a block LDL^ factorization of 
the discrete Helmholtz operator. The overall process is to eliminate the the unknowns 
layer by layer, starting from an layer with Sommerfeld condition specified. The main 
observation is that each intermediate n x n Schur complement matrix of this block LDL^ 
factorization roughly corresponds to the restriction of a half-space Green's function to 
a line and these Schur complement matrices are highly compressible with low-rank off- 
diagonal blocks. Representing and manipulating these matrices in the hierarchical matrix 
framework [7J requires only O(nlogn) space and O(nlog^n) steps. As a result, the block 
LDL^ factorization takes O(n^log^n) = O(A^log^A^) steps. The resulting block LDL^ 
factorization serves as an excellent preconditioner for the discrete Helmholtz system and 
applying it to any vector takes only O(n^logn) = O(A^logA^) steps using the hierarchical 
matrix framework. By combining this preconditioner with GMRES, we obtain iteration 
numbers that are almost independent oi uj. In a typical example with a computational 
domain of 256 x 256 wavelengths and four million unknowns, only 3 to 4 GMRES iterations 
are required (see Section [s]). 

We also extend this approach to the 3D case and construct an approximate block LDL^ 
factorization by eliminating the unknowns face by face, starting from a face with Sommer- 
feld condition specified. Though each intermediate n? x n? Schur complement matrix still 
corresponds to the restriction of a half-space Green's function to a face, the off-diagonal 
parts may not be of numerically low-rank. However, since the goal is to construct a pre- 
conditioner, we still represent and manipulate these matrices under the hierarchical matrix 
framework. Numerical results show that applying the resulting preconditioner is highly 
efficient and the preconditioned GMRES solver converges in a small number of iterations, 
weakly depending on uj. 

The main observation of the sweeping preconditioner comes from the analytic low-rank 



property of the Green's function of the continuous Helmholtz operator. On the other 
hand, the algorithms construct the approximation to the Green's function of the discrete 
Helmholtz operator. It is important that this Green's function is calculated from the 
discretized problem to be solved numerically and is not an independent approximation of 
the continuous analogue. 

1.2 Related work 

There has been a vast literature on developing efficient algorithms for the Helmholtz equa- 
tion. A wide class of methods for special sets of solutions are based on asymptotic expansion 
of the solution u{x). These techniques of geometric optics type are efficient when cj is very 
large. A review article on these methods can be found in [16]. There is also a class of 
methods based on boundary integral or volumetric integral representations. These inte- 
gral equation methods can be highly efficient for piecewise constant velocity fields when 
combined with fast summation methods such as the fast multipole methods and the fast 
Fourier transforms ^ [9l [T71 [181 EHl S^- Here we will focus on the methods that discretize 
the Helmholtz equation directly. 

The most efficient direct methods for solving the discretized Helmholtz systems are 
the multifrontal methods or their pivoted versions [121 ESI EH] • The multifrontal methods 
exploit the locality of the discrete operator and construct an LDL^ factorization based on 
a hierarchical partitioning of the domain. Their computational costs depend quite strongly 
on the dimensionality. In 2D, for a problem with N — n x n unknowns, a multifrontal 
method takes 0(A^^/^) steps and 0(A^log A^) storage space. The prefactor is usually rather 
small, making the multifrontal methods effectively the default choice for the 2D Helmholtz 
problem. In 3D, for a problem with N — nx nx n unknowns, a multifrontal method takes 
O(n^) = 0(N'^) steps and O(n^) = 0(N^/^) storage space. For large scale 3D problems, 
they can be very costly. 

In the setting of the elliptic operators, the intermediate matrices of the multifrontal 
methods can be well approximated using hierarchical matrix algebra and this allows one to 
bring the cost down to linear complexity in both 2D and 3D [36, 45j. This is, however, not 
true for the Helmholtz operator. As we pointed out, the sweeping preconditioner introduced 
in this paper is also based on constructing an LDL^ factorization of the Helmholtz operator. 
However, due to its specific sweeping (or elimination) order, which is very different from 
the one of the multifrontal methods, we are able to represent the intermediate matrices in 
a more effective way and obtain a highly efficient preconditioner. 

There has been a surge of developments in the category of iterative methods for solving 
the Helmholtz equation. The following discussion is by no means complete and more details 
can be found in [20j. 

Standard multigrid methods do not work well for the Helmholtz equation for several 
reasons. The most important one is that the oscillations on the scale of the wavelength 
cannot be carried on the coarse grids. Several methods have been proposed to remedy 
this [El [la [231 EH El SI]. For example in [8, 34j, Brandt and Livshits proposed the 
wave-ray method. This method uses the standard smoothers to remove the coarse and fine 
components of the residue. It also decomposes the component that oscillates on the scale 
of the wavelength into rays pointing at different directions. Each ray is further represented 
with a phase and amplitude representation, and the amplitude is relaxed with an anisotropic 
grid aligned with the ray direction. A limitation of the wave-ray method is however that 
the method is essentially restricted only to the case of constant velocity field. We would 



like to point out that there is a connection between the wave ray method and the sweeping 
preconditioner proposed in this paper, as both methods exploit the analytic behavior of the 
Green's function of the Helmholtz equation. The wave ray method relies on the Green's 
function over the whole domain, while the sweeping preconditioner uses its restriction on a 
single layer. 

Several other methods [21 [TTl [33] leverage the idea of domain decomposition. These 
methods are typically quite suitable for parallel implementation, as the computation in 
each subdomain can essentially be done independently. However, convergence rates of the 
these methods are usually quite slow [20] . 

Another class of methods [H |2T1 |22l |30] that attracts a lot of attention recently precon- 
ditions the Helmholtz operator with a shifted Laplacian operator, 

A ^— — (a + i/3), a > 0, 

to improve the spectrum property of the discrete Helmholtz system. Since the shifted 
Laplacian operator is elliptic, standard algorithms such as multigrid can be used for its 
inversion. These methods offer quite significant improvements for the convergence rate, 
but the reported number of iterations typically still grow linearly with respect to uj and are 
much larger than the iteration numbers produced by the sweeping preconditioner. 

Several other constructions of preconditioners [3", 'M', ^38] are based on incomplete LU 
(ILU) decomposition, i.e., generating only a small portion of the entries of the LU factoriza- 
tion of the discrete Helmholtz operator and applying this ILU decomposition as a precondi- 
tioner. Recent approaches based on ILUT (incomplete LU factorization with thresholding) 
and ARMS (algebraic recursive multilevel solver) have been reported in [38j. These ILU 
preconditioners bring down the number of iterations quite significantly, however the number 
of iterations still scale typically linearly in cj. In connection with the ILU preconditioners, 
the sweeping preconditioner can be viewed as an approximate LU (ALU) preconditioner: 
instead of keeping only a few selected entries, it approximates the whole inverse operator 
more accurately in a more sophisticated and effective form, thus resulting in substantially 
better convergence properties. 

1.3 Contents 

The rest of this paper is organized as follows. Section[2|presents the sweeping preconditioner 
in the 2D case and Section [3] reports the 2D numerical results. We extend this approach to 
the 3D case in Section [4] and report the 3D numerical results in Section [5j Finally, Section 
[6] discusses some future directions of this work. 

2 Preconditioner in 2D 

2.1 Discretization 

Recall that the computational domain is D — (0, 1)^. Let us assume for simplicity that 
the Sommerfeld condition is specified over the whole boundary. One standard way of 
incorporating the Sommerfeld boundary condition into ([l]) is to use the perfectly matched 



layer (PML) |ll[IOl[29]. Introduce 
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Here 77 is typically about one wavelength and C is an appropriate positive constant inde- 
pendent of (jj. The PML approach replaces di with si{xi)di and 82 with S2(x2)d2, which 
effectively provides a damping layer of width 77 near the boundary of the domain [0, 1]^. 
The resulting equation is 



(sidi)(sidi) + (S2d2)(s2d2) + 
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X edD. 



Without loss of generality, we assume that f{x) is supported inside [77, 1 — 77]^ (away from 
the PML). Dividing the above equation by 5i(xi) 52(^2) results 
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The advantage of working with this equation is that it is symmetric, which offers some 
convenience from the algorithmic point of view. We discretize the domain with a Cartesian 
grid with spacing h — 1/(72 +1). In order to discretize each wavelength with a couple of 
points, the number of points n in each dimension needs to be proportional to uj. We assume 
that n to be an integer power of two for simplicity. The interior points of this grid are 

^ = {Phj = {^^^Jh) : 1 < i, j < 77.} 
(see Figure [1] (left)) and the total number of points N is equal to 
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Figure 1: Left: Discretization grid in 2D. Right: Sweeping order in 2D. The dotted grid 
indicates the unknowns that have already been eliminated. 

We denote by Uij, fij, and Cij the values oiu{x), /(x), and c{x) at point pij = (ih^jh). 
The standard 5-point stencil finite difference method writes down the equation at points in 



V using central difference. The resulting equation at pi^j — (ih^jh) is 
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(•••) ^ij = kj (3) 



with Uifj' equal to zero for {i' ^f) that violates 1 < i',/ < n. Here (• • • ) stands for the sum 
of the four coefficients appeared in the ffist line. We order uij row by row starting from 
the first row j — 1 and denote the vector containing all unknowns by 

U = (^/l,l, '^2,1, . . • , Un^i, . . . , Ui^n, ^2,n, • • • , ^n,n) • 

Similarly, /^j are ordered in the same way and the vector / is 

/ ^ v/l,!' /2,l5 • • • 5 /n,l5 • • • 5 /l,n5 /2,?25 • • • 5 Jn^n) • 

Then (pi) takes the form At/, = /. We further define Vm to be the unknowns in the m-th 

' m ^ \Pl,mi • • • iPn,mj 



row 



and introduce 
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Then 



«=K,4,...,<)* and f = iflfl...,fj. 
Using these notations, the system Au = / takes the following tridiagonal block form 
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where A^^^ are tridiagonal matrices and Am^m-i — ^m-i m ^^^ diagonal matrices. 

We introduce the notion of the sweeping factorization^ which is essentially a block LDL^ 
factorization of A that eliminates the unknowns layer by layer. Starting from the first row 
of unknowns Vi gives 
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where Si = ^1,1, 5^2 = ^2,2 — ^2,1*5]^ ^1,2, and the matrix Li is a block lower-triangular 
matrix given by 



Li{V2,Vi) = A2^iSi ^, Li{Vi,Vi) = / (1 < i < n), and zero otherwise. 



Repeating this process over all Vm for tti = 2, . . . , n — 1 gives 
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where Sm = Am^m - ^m,m-iS^\Al^_^^^ for m = 2, 3, . . . , n. The matrix Lm is given by 
LmiVm^i^Vm) = Am+i,m5'^\ Lm{Vi,Vi) = / (1 < z < n), and zero otherwise. 

This process is illustrated graphically in Figure [l] (right). Inverting this factorization Q 
for A gives the following formula for u: 
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Algorithmically, the construction of the sweeping factorization of A can be summarized 
as follows by introducing T^ = S^^ . 

Algorithm 2.1. Construction of the sweeping factorization of H. 



Si = Ai,i andTi = 5f\ 



for 771 = 2, . . . , n do 

end for 

Since Sm and T^ are in general dense matrices of size n x n^ the cost of the construction 
algorithm is of order O(n^) = 0{N'^). The computation oi u = A~^f is carried out in the 
following algorithm once the sweeping factorization is ready. 

Algorithm 2.2. Computation of u — A~^f using the sweeping factorization of A. 

1: for 771 = 1, ... ,n do 

3: end for 

4: for 771 = 1, . . . , n — 1 do 

6: end for 

7: for 771 = 1, ... ,77 do 

9: end for 

10: for 771 = 77 — 1, . . . , 1 do 

11: Ujji ^ 'Uttt, 1 jjiyJ\jjijjij^\UjjiJ^\j 

12: end for 

Obviously the computations of TmUm in the second and the third loops only need to be 
carried out once. However, we prefer to write the algorithm this way for simplicity. The 
cost of computing u is of order 0(77^) = 0(A^^/^). This is O^N^/'^) times more expensive 
compared to the multifrontal methods, therefore Algorithms |2. 1| and |2.2| themselves are not 
very useful. 



2.2 Main observation 



Let us consider the meaning of the matrix T^ 
blocks of the factorization Q. 



Consider only the top-left m x m 
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where the L^ matrices are redefined to their restrictions to the top-left m x m blocks. The 
matrix on the left is in fact the discrete Helmholtz operator of the half space problem below 
X2 = {m -\- l)h and with zero boundary condition on X2 = (m + l)/i. Inverting the above 
factorization gives 
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The matrix on the left is the discrete half-space Green's function of the Helmholtz operator 
with zero boundary condition. On the right side, due to the definition of the matrices 
Li, . . . , Ljn-i, the (m, m)-th block of the whole product is exactly equal to 5"^. Therefore, 

Tm = S'"^ is the discrete half-space Green function of the Helmholtz operator 
with zero boundary at X2 = (m + l)/i, restricted to the points on X2 = mh. 

The main observation of our approach is that 

Tjn and Sm are highly compressible with numerically low-rank off-diagonal blocks. 

The following theorem shows that this is true for the continuous half-space Green's function 
for the case of constant velocity field c(x) = 1. 

Theorem 2.3. Let 

Y = [pi,m = {ih,mh),i = 1,...,-| and X = jp^,^ = {ih,mh),i = - + l,...,nj , 

and G he the (continuous) half-space Greenes function of the Helmholtz operator for the do- 
main (— oc,oc) X (—00,(771 + l)h) with zero boundary condition. Then (G(x^y))xex,yeY 
is numerically low-rank. More precisely, for any e > 0, there exist a constant R = 
0(logcj| logsp); functions {ar{x)}i<r<R for x G X and functions {l3r{y)}i<r<R for y ^Y 
such that 
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< e for X ^ X,y ^Y. 



The proof of this theorem relies on the following theorem from [37j. Let Hq{-) be the 
0-th order Hankel function of the first kind. 



Theorem 2.4. Let uj be the angular frequency and A = 27r/uj. Let W > 0. There exists 
C{W) such that, for L > 0, e > 0, and S > C{W)\\oge\ • ^, there exist a constant 
J < log(cL;L)| log£p; functions {0j(^)}i<j<j; and functions {Xj{y)}i<j<J such that 
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Figure 2: Left: The setting of Theorem 2.4, Right: The setting of Theorem 2.3 



The setting of this theorem is ihustrated in Figure ^^ (left). Using Theorem 2.4, the 



proof of Theorem 2.3 goes as fohows 



Proof of Theorem \2.3[ Let W = 2h. We partition the set X into the union of the near set 
Xn and the far set Xp depending on the distance from Y. 

Xn^!^P^ {pi,P2) eX,pi<^ + lc{W)\ \og{s/2)\X 

Xf = L= {pi,p2) e X,pi > ^ + ^C{W)\log{s/2)\X } . 

Similarly, Y is partitioned into the union of Y^ and Yjr 

YN = ip= {pi,P2) eY,pi>^- ^C{W)\logie/2)\X 

Yf = L= ipi,P2) e Y,pi < ^ - ^CiW)\\og{e/2)\x\ . 

See Figure^ (right). These partitionings introduce a natural block structure for the matrix 

{G{x,y))j,^x,yeY- 

\G{x, y))xeXN,yeYN {G{x, y))x^XN,y^YF\ /^x 

,(G(x, y))x^XF^y^YN {G{x, y))xeXF,yeYF J 

Let p = X/h be the number of points per wavelength. It is clear from the definition of 
Xn and Y/v that each of them has at most ^C{W)\loge\X/h — ^C{2h)\loge\p points. 
Hence the ranks of the (1, 1), (1, 2), and (2, 1) blocks of Q are all bounded from above by 
lC{2h)\\ogs\p. 



Let us consider the (2, 2) block. Define M{Yf) to be the mirror image set of the set Yp 
with respect to the hne X2 = (m + l)h. Due to the zero Dirichlet boundary condition at 
X2 = (m + l)/i, for X G Xp and y ^Yp 

G{x, y) = H^{uj\x - y\) - Ho{uj\x - M{y)\) 

where M{y) G AHYf) is the mirror image of y. Yp [JM{Yf) is contained in the box 
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[m/i, (m + 2)/i]. 



Since the distance between these two boxes is C{W)\log{£/2)\X and their widths are 
bounded by 1, Theorem 2.4 guarantees that there exist a constant J < log(cj)| log(6/2)p, 
functions {(l>j(x)}i<j<j, and functions {Xj{y)}i<j<J such that 



Ho{cj\x -y\)-Y^ (j)j{x)xj{y) 



for X E Xf and y E Yf[JM{Yf). This imphes that 
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Combining this with the estimates for the other three blocks shows that there exists 
R = |C(2/i)| log(5/2)|p+log(cj)| log(£/2)p = 0(log(cj)| log^p) and functions {ar{x)}i<r<R 
for X E X and functions {^r{y)}i<r<R for y E y such that 
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< 6 for X ^ X^y eY. 
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For a fixed 5, Theorem |2.3| shows that the rank R grows logarithmically with respect to 
(jj (and thus to n). Though the theorem states the result under the case that X contains 
the points on the left half and Y contains the points on the right half, it also applies to any 
disjoint intervals X and y on X2 = rah due to the translational invariance of the kernel 
G{x,y) in the xi direction. It is also clear that, when X and Y are well-separated from 
each other, the actual rank R should be smaller. 

Theorem |2.3| can be extended to the case of smooth layered media where the velocity 
variation only depends on xi. In this case, the restriction of the Green's function to X2 = mh 
does not develop caustics. Therefore, the geometric optics representation A{x,y)e'^^^^^'y^ 
of the Green's function for x E X and y ^ Y can be made sufficiently accurate as long 
as X and Y are well-separated. The amplitude A(x^ y) is numerically low-rank due to 
its smoothness. The phase term is also numerically low-rank since for the layered media 
$(x,?/) = t(x) — T{y) where t(-) is the travel time function from a fixed point. Therefore, 



10 



their product, the Green's function G{x^y)^ is also numerically low rank for well-separated 
X and Y. 

Numerical experiments confirm the result of Theorem 2.3 For the constant coefficient 
case c{x) — 1 with ^ = 32 (n = 256), Figure Isl (left) shows the numerical ranks of the 
off-diagonal blocks of T^ for m = 128. For each off-diagonal block, the singular values of 
this block are calculated and the value in each block indicates the number of singular values 
that are greater than 10~^. For non-constant velocity fields c(x), the rank estimate would 
depend on the variations in c{x) and numerical results suggest that the off-diagonal blocks 
of Tjn and Sm still admit this low-rankness property for a wide class of c{x). An example 
for the non-constant velocity field is given in Figure [s] (middle). 

We would like to emphasize that both the Sommerfeld boundary condition and the 
layer- by-layer sweeping order are essential. To illustrate that, we perform the same test 
with the same threshold 10~^ but with zero Dirichlet boundary condition. The result of 
Tjri for m = 128 is plotted in Figure [s] (right). It is clear that the rank of a off-diagonal 
block is much higher and grows almost linearly with respect to the size of the block. This 
clearly shows the importance of the Sommerfeld boundary condition. A similar matrix T^ 
would also appear if one adopts different elimination orders such as the one of multifrontal 
methods or the one proposed in [36j. Therefore, these elimination orders do not result 
efficient solution methods for the Helmholtz equation. 
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Figure 3: Numerical ranks of off-diagonal blocks of T^. Left: Constant coefficient case 
with PML boundary condition. Middle: Non-constant coefficient case with PML boundary 
condition. Right: Constant coefficient case with zero Dirichlet boundary condition. 



2.3 Hierarchical matrix representation 

Since T^ and Sm are highly compressible with numerically low-rank off-diagonal blocks, it is 
natural to represent these matrices using the hierarchical matrix (or 7{-matrix) framework 
proposed by Hackbusch et al [31261 [27], where off-diagonal blocks are represented in low- 
rank factorized form. The discussion below is by no means original and is included for the 
sake of completeness. 

At the 7Ti-th layer for any fixed tti, we construct a hierarchical decomposition of the grid 
points in Vm through bisection. At level (the top level), the set 

At level £, there are 2^ sets J^^ for i = 1, . . . , 2^ given by 

J! = {Pt,m : (i - 1) • n/2^ + l<t<i- n/2^}. 
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The bisection is stopped when each set J^ contains only a small number of indices. Hence, 
the number of total levels L is equal to log2 n — 0(1) (see Figure [4] (left)). We often write 
G{Ji^ Ji>) (the restriction of a matrix G to J^ and J^,) as G\^,. 



J? 



Ji 



Ji 



Ji 



Ji 



j^ 



I I I I 



Ji 



I I I I I I I I I I I I I I 



Figure 4: Hierarchical matrix representation. Left: Hierarchical partitioning of the index 
set J for each layer. Right: Induced partitioning of the matrix T^ in the weakly admissible 
case. Off-diagonal blocks (in white) are stored in low-rank factorized form. Diagonal blocks 
(in gray) are stored densely. 

The hierarchical matrix representation relies on the notion of well-separatedness between 
different sets. If jf and jf, are well-separated from each other, then G{jf,jf,) is allowed 
to be stored in a low-rank factorized form. There are two different choices of the notion of 
well-separatedness [7J. In the weakly admissible case, J-; and J^, are well-separated if and 
only if they are disjoint. In the strongly admissible case, J^ and J^, are well-separated if and 
only if the distance between them is greater than or equal to their width. Next, define the 
interaction list of J^ to be the set of all index sets J^, such that J-; is well-separated from 
J^, but J/^^'s parent is not well-separated from JT^f 's parent. It is clear from this definition 
that being a member of another set's interaction list is a symmetric relationship. 

2.3.1 Weakly admissible case 

In the weakly admissible case, the interaction list of J21 contains only J2i-i and vice versa. 

Matrix representation. For a fixed 5, let R — O(logcj) = O(logn) be the maximum 
over the ranks of the off-diagonal blocks on all levels. For a given matrix G, the hierarchical 
matrix framework represents all blocks G\^, = G{jf,jf,) with jf and jf, in each other's 
interaction list in the factorized form with rank less than or equal to R. For example, at 
the first level, the two off-diagonal blocks Gj 2 = G{Ji,J2) and G21 ^ G{J2^Ji) are 
represented with 

Gl,^Ul,{Vl,y and Gl, ^ Ul.iViJ , 

where each of 11^2^ ^21' ^1^2' ^2^1 ^^^ ^^ most R columns. At the second level, the new off- 
diagonal blocks are G121 ^21, G34, and G43, each represented in a similar way. Finally, 
at level L — 1, all diagonal blocks G-^^ for i = 1,...,2^~^ are stored densely. This 
representation is illustrated in Figure [4] (right). The total storage cost is 0{Rnlogn). 
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Matrix-vector multiplication. Let us consider the product Gf where / is a vector of 
size n. Denote by // the part of / restricted to if. Using the block matrix form, the 
product is 

^G^i Gi2\ ffi\ ^ (Gi^ifi +G12/2' 

v^2,l ^2,2/ v/2/ vG'2,1/1 +^^2,2/2^ 

First, the product G\ 2/2 is computed with G\ 2/2 ^ ^12 ((^i''^2)V2)- The same is carried 
out for the product G\ ifl- Second, the computation of Gj ifl and G^ 2/2 is done recur- 
sively since both G\ i and G2 2 are in the hierarchical matrix form. We denote this matrix- 
vector multiplication procedure by hnnatvec(G, /) and its computational cost is 0(Rn\ogn). 



Matrix addition and subtraction. Consider the sum of two matrices G and H with 



their off-diagonal blocks represented in the factorized form by Gf ■ ^ U^AVf-Y and Hf 



Xf AY/-Y. Under the block matrix notation, the sum is 

G'l,! G\^ ^ (ii\\ ^\;2\ ^ (G\\ + R\^^ Gj 2 + ^1,2 

^^2,1 ^^2^2/ V^2,l ^2,2/ V^2,l + ^2,1 ^^2^2 + ^2,2 



First, Gl^2 + ^i,2 ^ ^i,i(V^i'2)' + ^i,2(n'2)' = (^1,2,^1,2) (^i'2,n'2)'- One uccds to rccom- 
press the last two matrices in order to prevent the rank of the low rank factorization from 
increasing indefinitely. This can be done by computing QR decomposition of (^125^12) 
and (V^i^2 5 ^1^2)5 followed by a truncated SVD of a matrix of small size. The same procedure 
is carried out for G\ \^ii\ \ to compute the necessary factorization. Second, let us consider 
the diagonal blocks. Gj ^^ + ^\\ ^^^d G2 2 + ^22 ^^^ done recursively since they are two 
sums of the same nature but only half the size. This addition procedure is denoted by 
hadd(G, i7). The subtraction procedure is almost the same and is denoted by hsub(G,i7). 
Both of them take 0(i?^n log n) steps. 



Matrix multiplication. Let us consider the sum of two matrices G and H with their 



off-diagonal blocks represented by G\- ^ U^AVf-Y and Hf ■ ^ XfAYf-Y- Under the block 



matrix form, the product is 

G'l,! Gi2\ 1^^1,1 ^i,2A ^ f^iiH^^ + G^ 2^2,1 GiiH^2 + G'l 2^2,2 

^^2,1 ^^2^2/ V^2,l ^2,2/ vG'2, 1^1,1 + ^^2,2^2,1 G2^iH^2 + ^^2^2^2,2 

First, the off-diagonal block G\^Hl2 + Gj 2^2,2 ^ G'j 1X12(^1^2)* + ^i,2(l^i^2)^^2,2- The 
computation G\iX\2 and (V^i''^2)^^2 2 ^^^ essentially matrix- vector multiplications. Once 
they are done, the remaining computation is then similar to the off-diagonal part of the ma- 
trix addition algorithm. The other off-diagonal block G2 1^1 1 + G2 2^2 1 ^^ done in the same 
way. Next, consider the diagonal blocks. Take G\ yH\ i^G\ 2H2 i as an example. The first 
part G\ lii^i 1 is done using recursion. The second part is G\ 2^2 1 ^ ^1 2 (^i''^2)^^2 1(^2^1)^5 
where the middle product is carried out first in order to minimize the computational cost. 
The final sum G\ iHl-^ + G\ 2^2 1 ^^ done using the matrix addition algorithm described 
above. The same procedure can be carried out for G2 1^1 2 + G^ 2^2 2- This matrix multi- 
plication procedure is denoted by hnnul(G,i7) and its computational cost is 0(i?^n log^ n). 
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Matrix inversion. The inverse of G is done by performing a 2 x 2 block matrix inversion: 

where S — G\2 ~ G\i{G\i)~^G\2- The computation of this formula requires matrix 
additions and multiplications, along with the inversion of two matrices S and G\ ^^ half of 
the original size. The matrix additions and multiplications are carried out by the above 
procedures, while the inversion are done recursively. This matrix inversion procedure is 
denoted by hinv(G) and its cost is 0{R^n\o^ n). 

Multiplication with a diagonal matrix. Finally, we consider the multiplication of G 
with a diagonal matrix D. Denote the two diagonal blocks of D on the first level by D\ ^ 
and D221 both of which are diagonal matrices. In the block matrix form, the product 
becomes 

^G^i Gi2\ /^^i,i A ^ (Gi^iD^^ G'l 2^2,2^ 

V^2,l ^2,2/ \ ^2,2/ V^2, 1^1,1 G'2^2^2,2y 

Consider the off-diagonal blocks first. For example, ^12^22 ^ ^1 2(^i''^2)^^2 2 ^^^ ^^'^^ ^^ 
done by scaling each columns of (^1^2)^ by the corresponding diagonal entries of i?2 2- "^^^ 
same is true for G\ iD\ ^. For the diagonal blocks, say G\ yD\ ^^ we simply apply recursion 
since G\ ^ is itself a hierarchical matrix and D\ ^ is diagonal. This special multiplication 
procedure is denoted by hdiagnnul(G, i?) if D is on the right or hdiagnnul(L), G) if D is on 
the left. The cost of both procedures is 0{Rn\ogn). 

2.3.2 Strongly admissible case 

The matrix representation and operations in the strongly admissible case are similar to the 
ones in the weakly admissible case. The only one that requires significant modification is 
the matrix multiplication procedure R — hnnul(G,i7), where the most common step is the 
calculation of 

^i,i" ^ G^^ifH-,-,,. (8) 

In order to simplify the discussion, we denote a matrix symbolically by H if it is in hier- 
archical form and by F if it is represented in a factorized form. The product (pi) can then 
take one of the following eight forms 



H = HH, 


H = H • F, 


H = FH, 


H = F • F 


F = HH, 


F = HF, 


F = FH, 


F = F • F. 



All of them except one have already appeared in the matrix multiplication procedure of 
the weakly admissible case and the only one that is new is F = H • H. We implement this 
using the randomized S VD algorithm proposed recently in [28l 132] for numerically low-rank 
matrices. The main idea of this randomized algorithm is to capture the column (or row) 
space of the matrix by multiplying the matrix with a small number of Gaussian random 
test vectors. Results from random matrix theory guarantee that the column space of the 
product matrix approximates accurately the span of all dominant singular vectors of the 
original (numerically low-rank) matrix. Since the product matrix has much fewer columns, 
applying singular value decompositions to it gives rise to an accurate and efficient way to 
approximate the SVD of the original matrix. Notice that this randomized approach only 
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requires a routine to apply the original matrix to an arbitrary vector and everything else is 
just standard numerical linear algebra. In our setting, applying H • H to a vector is simply 
equal to two hmatvec operations. 

2.4 Approximate inversion and preconditioner 

Let us denote the approximations of Sm and T^ in the hierarchical matrix representation 
by Sm and T^, respectively. The construction of the approximate LDL^ factorization of H 
takes the following steps. 

Algorithm 2.5. Construction of the approximate sweeping factorization of H in the hier- 
archical matrix framework. 



Si = Ai^i and Ti = hinv(S'i). 
for 771 = 2, . . . , n do 

Sm = hsub(A^,^,hdiagnnul(A^,^_i,hdiagnnul(T^_i,A^_i,^)) and Tm = hinv(S',n). 
end for 
The cost of Algorithm [2^ is 0{R'^n'^log^ n) = 0{R'^Nlog^ N). The computation of u 



A ^ f using the this approximate factorization is summarized as follows. 

Algorithm 2.6. Computation of u ^ A~^f using the approximate sweeping factorization 
of A in the hierarchical matrix framework. 
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for 771 = 1, ... ,n do 



'^771 — Jm 

end for 

for 771 = 1, . . . , 77 — 1 do 

Um+i = Um+1 - ^m+i,m ' hnnatvec(r^, Um) 
end for 
for 771 = 1, ... ,77 do 

Um = hnnatvec(T^,ii^) 
end for 
for 771 = 77 — 1, . . . , 1 do^ 

Um = Um- hm3tyec{Tm, Am,m+lUm+l) 

end for 

The cost of Algorithm 2.6 is 0(i?77^ log77) = 0{RN log N). Algorithm 2.6 defines an oper- 



ator 

which is an approximate inverse of the discrete Helmholtz operator A. When the threshold 
£ is set to be sufficiently small, M can be used directly as the inverse of H and u can be 
taken as the solution. However, a small £ value means that the rank R of the low-rank 
factorized form needs to be fairly large, thus resulting large storage and computation cost. 



On the other hand, when R is kept rather small. Algorithms 2.5 and 2.6 become highly 
efficiently both in terms of storage and time. Though the resulting M is not accurate 
enough as the inverse of A, it serves as an excellent preconditioner. Therefore, we solve the 
preconditioner system 

MAu = Mf 

using iterative solvers such as GMRES and TFQMR [HI [32]. Since the cost of applying 
M to any vector is 0{RN log N)^ the total cost of the iterative solver is 0{NjRN log N)^ 
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where Nj is the number of iterations. The numerical results in Section |3] demonstrate that 
Nj is in practice very small, thus resulting an algorithm for almost linear complexity. 



Theorem 2.3 shows that in the constant coefficient case the hierarchical matrix represen- 
tation of Tjn is accurate. Therefore, the preconditioner M well approximates the inverse of 
A and the number of iterations Nj is expected to be small. The numerical results in Section 
[3] demonstrates that Nj is also small for general velocity field such as converging lens, wave 
guides, and random media. Here we provide a heuristic explanation for this phenomena. 
For the variable coefficient case, the numerical rank of the off-diagonal blocks of T^ can 
potentially increase mainly due to the turning rays, i.e., the rays that leave the 7n-th layer 
downward, travel horizontally in xi direction, and come upward back to the 777.-th layer. 
The interactions related to turning rays are difficult to capture in the hierarchical matrix 
representation of T^ if R is small. However, the iterative solver addresses this interaction 
in several steps as follows: the downward part of the ray is processed by a first few sweeps, 
the horizontal part is then captured by the T^ matrix of the next sweep, and finally the 
upward part of the ray is processed by a couple of extra sweeps. 

In the presentation of the sweeping preconditioner, we choose the sweeping direction 
to be in the positive direction of the X2 axis. It is clear that sweeping along either one of 
the other three directions also gives a slightly different sweeping preconditioner. Due to 
the variations in the velocity field and, more precisely, the existence of the turning rays, a 
carefully selected sweeping direction can often result significantly fewer number of GMRES 
iterations than the other directions do. We will give one numerical example to demonstrate 
this phenomenon in Section [3j 



2.5 Other boundary conditions 

So far, we discuss the case with Sommerfeld boundary condition specified over the whole 
boundary. From the above discussion, it is clear that the success of the preconditioner only 
relies on the fact that Sm and T^ are compressible. For many other boundary conditions, 
the matrices Sm and T^ also have this property, as long as the Helmholtz problem is not 
close to resonance. Here, we mention three representative examples. 



X2a Dirichlet 



u(xi, 1) = b{xi) 




PML 



1 Xi 




PML 




PML 



Figure 5: Mixed boundary conditions. Left: Depth extrapolation problem in seismology. 
Middle and right: Problems with partly zero Dirichlet boundary condition and non-zero 

In the first example (see Figure [5] (left)), the PML boundary condition at X2 = 1 is 
replaced with non-trivial Dirichlet boundary condition u{xi,l) = b{xi) and / is equal to 
zero. This corresponds to the depth extrapolation problem [3 135] in refiection seismology. 
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The proposed algorithm proceeds exactly the same and the only modification is that the 
boundary condition b{xi) is transformed into an appropriate forcing term at last layer of 
unknowns (i.e., the index set Vn)- 

In the second example, the zero boundary condition is mixed with the PML condition. 
In Figure [s] (middle)) the zero Dirichlet boundary condition is specified on xi = and 
xi = 1. The matrix T^ then corresponds the restriction (to an edge) of the Green's 
function of the discrete Helmholtz operator in a half strip. By using the imaging method 
also in the xi direction, one can show that the rank of the off-diagonal blocks is bounded 
by 0(logCL;| log^p) with a slightly larger constant due to the mirror images. In Figure pi 
(right)), the zero Dirichlet boundary condition is specified on xi = 1 and X2 = 1. T^ 
corresponds to the restriction of the Green's function of the discrete Helmholtz operator in 
a quadrant in this case. 

Finally, the PML boundary condition is by no means the only approximation to the 
Sommerfeld condition. As the essential requirement is that the problem should not be 
close to resonance (i.e., a wave packet escapes the domain without spending too much time 
inside), the sweeping preconditioner should work with any reasonable approximations to 
the Sommerfeld boundary condition such as absorbing boundary conditions (ABCs) [T^ ITS] 
and damping/sponge layers. We focus on the PML due to its simplicity, its low non-physical 
reflections, and the symmetry of its discrete system. 

3 Numerical Results in 2D 

In this section, we present several numerical results to illustrate the properties of the sweep- 
ing preconditioner described in Section [2j The implementation is done in C++ and the 
results in this section are obtained on a computer with a 2.6GHz CPU. The GMRES method 
is used as the iterative solver with relative residue tolerance set to be 10~^. 

3.1 PML 

The examples in this section have the PML boundary condition specified at all sides. 

Dependence on uj. First, we study how the sweeping preconditioner behaves when uj 
varies. Consider three velocity fields in the domain (0, 1)^: 

1. The first velocity field is a converging lens with a Gaussian profile at the center of 
the domain (see Figure [6]^a)). 

2. The second velocity field is a vertical waveguide with Gaussian cross section (see 
Figure [6]^b)). 

3. The third velocity field has a random velocity field (see Figure [6]^c)). 

For each velocity field, we test with two external forces f{x). 

1. The first external force f{x) is a Gaussian point source located at (xi, X2) = (0.5, 0.125). 
The response of this forcing term generates circular waves propagating at all direc- 
tions. Due to the variations of the velocity field, the circular waves would bend, form 
caustics, and intersect. 
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2. The second external force f{x) is a Gaussian wave packet with a wavelength com- 
parable to the typical wavelength of the Helmholtz equation. This packet centers at 
(xi,X2) = (0.125, 0.125) and points to the (1, 1) direction. The response of this forcing 
term generates a Gaussian beam initially pointing towards the (1,1) direction. Due 
to the variations of the velocity field, this Gaussian beam should bend and scatter. 




(a) 



I" 

- ^1.15 
1.1 



i 



0.85 

0.8 




(b) 



(c) 



Figure 6: Test velocity fields. 



For each velocity field, we perform tests for ^ = 16, 32, . . . , 256. In these tests, we 
discretize with g = 8 points per wavelength. Therefore, the number of points for each 
dimension is n = 8 x ^ = 128, 256, . . . , 2048. The strongly admissible case is used in the 
implementation of the hierarchical matrix representation. Recall that R is the rank of the 
off-diagonal blocks in the hierarchical matrix and we fix it to be a uniform constant 2. In 
all tests, the sweeping direction is bottom-up from X2 = to X2 = 1. 

The results of the first velocity field are summarized in Table IT} Tgetup denotes the 
time used to construct the preconditioner in seconds. For each external force, A^iter is the 
number of iterations of the preconditioned GMRES solver and Tgoive is the overall solution 
time. When n doubles and N quadruples, the setup cost Tgetup increas es b y a factor of 5 

A remarkable 



2.5 



or 6, which is consistent with the 0(A^log^ N) complexity of Algorithm 
feature of the sweeping preconditioner is that the number of iterations is extremely small. 
In fact, in all cases, the preconditioned GMRES solver converges in less than 3 iterations. 
As a result of the constant iteration number, the solution time increase by a factor of 4 or 5 



when N quadruples, which is consistent with the 0(A^log A^) complexity of Algorithm 2.6 



Finally, we would like to point out that our algorithm is extremely efficient: for a problem 
with N — 11? — 2048^ unknowns, the solution time is only about 30 seconds. 

The results of the second and third velocity fields are summarized in Tables [2] and [3j 
respectively. The behaviors of these tests are similar to the one of the first velocity field. 
In all cases, the GMRES solver converges in less than 5 iterations when combined with the 
sweeping preconditioner. 

Dependence on q. Secondly, we study how the sweeping preconditioner behaves when 
the number of discretization points per wavelength q varies. Fix ^ at 32 and let q be 
8, 16, ... , 64. In the following tests, R is again equal to 2. The sweeping direction is bottom- 
up from X2 = to X2 = 1. The test results for the three velocity fields are summarized in 
Tables [4} [5| and [6| respectively. These results show that the number of iterations remain 
to be extremely small and the overall solution time scales roughly linearly with respect to 
the number of unknowns. 
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Test 1 


Test 2 


a;/(27r) 


Q 


N = n^ 


R 


-^ setup 


iViter 


-^ solve 


iViter 


-^ solve 
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2 
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2 


5.00e-02 


2 


5.00e-02 


32 


8 
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2 


5.05e+00 


2 


2.50e-01 


2 


2.50e-01 


64 


8 
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2 


3.44e+01 


3 


1.45e+00 


3 


1.42e+00 
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8 


10242 


2 


2.16e+02 


3 


7.37e+00 


3 


7.36e+00 


256 


8 


20482 


2 


1.24e+03 


3 


3.31e+01 


3 


3.28e+01 



Table 1: Results of velocity field 1 for different cj. Top: Solutions for two external forces 
with (jj/{27t) = 64. Bottom: Results for different u. 
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Test 2 
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4 
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Table 2: Results of velocity field 2 for different u. Top: Solutions for two external forces 
with uj/{2ti) — 64. Bottom: Results for different uj. 
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Table 3: Results of velocity field 3 for different cj. Top: Solutions for two external forces 
with (jj/{27t) = 64. Bottom: Results for different uj. 
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Test 2 
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Table 4: Results of velocity field 1 for different q. 





Test 1 
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c^/(27r) 
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-^^iter -^ solve 


-^^iter -^ solve 
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32 
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2 5.37e+00 


32 


64 


20482 


2 1.23e+03 


2 2.50e+01 


2 2.49e+01 



Table 5: Results of velocity field 2 for different q. 

Dependence on sweeping direction. Next, we study how the sweeping directions affect 
the convergence rate of the GMRES algorithm. In this example, the velocity field is given 
by c(xi,X2) = 1/2 + X2. The external force is a a narrow Gaussian point source centered 
at (xi,X2) = (0.125,0.5). Two sweeping directions are tested here: the first one sweeps in 
the positive X2 direction while the second sweeps in the negative X2 direction. For the first 
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Test 1 


Test 2 


w/(27r) 


q 


N = n'^ 


J^ J- setup 


-^^'iter -^ solve 


-^^iter -^ solve 
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8 
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2 5.13e+00 


2 2.40e-01 


3 3.10e-01 


32 
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5122 


2 3.47e+01 
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32 


32 
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2 5.87e+00 


2 5.84e+00 


32 


64 
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2 2.52e+01 


2 2.51e+01 



Table 6: Results of velocity field 3 for different q. 

sweeping direction, the matrix T^ approximates the Green's function of the lower half-space 
(—00,00) X (—00,771/1). Since the velocity field decreases in the negative X2 direction, in a 
geometric optics argument the rays emanating from X2 — mh plane do not travel back to 
the same plane. Therefore, we expect that the numerical rank of the off-diagonal blocks of 
Tjn to be low, the preconditioner to be quite accurate, and the number of iterations to be 
small. The geometric theory of diffraction indicates that the coupling between points on 
the plane X2 = mh is via exponentially decaying creeping rays and thus very weak. 

For the second sweeping direction, the matrix T^ approximates the Green's function 
of the upper half-space (—00,00) x (1 — 777/1,00). Since the velocity field increases in the 
positive X2 direction, the rays emanating from X2 = mh can shoot back to the same plane. 
As a result, the hierarchical matrix representation of T^ would incur larger error for the 
same R value and the number of iterations would become larger. 

Table [7| reports the results of these two sweeping directions for different u values. As 
expected by the above argument, the number of iterations for the first sweeping precondi- 
tioner (in the positive X2 direction) remains very small while the number of iterations for 
the second one (in the negative X2 direction) increases slightly with N. 

3.2 Other boundary conditions 

Here we report three examples with different boundary conditions. 

Depth extrapolation. In the first example (see Figure [5] (left)), the velocity field is a 
vertical wave guide. We specify the Dirichlet boundary condition u{xi, 1) = b{xi) at the 
top edge X2 = 1 and the PML at the other three edges. This is the depth extrapolation 
problem in reflection seismology and we report the results of two test cases: 

1. b(xi) = 1. This corresponds to a plane wave entering the wave guide. The center 
part of the plane wave should start to bend and eventually form multiple caustics. 

2. b(xi) = exp (z^xi). This corresponds to a slant wave entering the wave guide. 

The sweeping direction is bottom- up from X2 = to X2 = 1. The results are summarized 
in Table [8j The running time again follows closely the analytical estimate and the number 
of GMRES iterations are bounded by 4. 

Mixed PML-Dirichlet boundary condition. In the second example, the velocity field 
c{x) is equal to constant one and perform two tests with mixed boundary conditions. 

1. In the first test (see Figure [5] (middle)), we specify the zero Dirichlet boundary condi- 
tion at xi = and xi = and the PML condition at the other two sides. The external 
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Table 7: Results of the positive and negative X2 sweeping directions. Top row: the velocity 
field (left) and the solution for the external force (right). Bottom row: results for different 



UJ. 



force f{x) is a Gaussian wave packet with a wavelength comparable to the typical 
wavelength of the Helmholtz equation. This packet centers at (xi,X2) = (0.5,0.125) 
and points to the (cos(7r/8), sin(7r/8)) direction. The Gaussian beam generated by 
this forcing term should bounce back from the edge xi = 1 and then from the edge 
xi = 0. 

2. In the second test (see Figure [s] (right)), we specify the zero Dirichlet boundary 
condition at xi = 1 and X2 = 1 and the PML condition at the other two sides. 
The external force f{x) is a Gaussian wave packet with a wavelength comparable to 
the typical wavelength of the Helmholtz equation. This packet centers at (xi,X2) = 
(0.5,0.125) and points to the (1, 1) direction. The Gaussian beam generated by this 
forcing term should bounce back from the edge xi = 1 and then from the edge X2 = 1. 

The sweeping direction is bottom-up from X2 = to X2 = 1 and the results of these tests 
are summarized in Table |9j The running time again follows the analytical estimate. For 
the first test case with zero Dirichlet boundary condition at xi = and xi = 1, due 
to the reason mentioned in Section |2.5[ the rank of the off-diagonal blocks of the Schur 
complement matrices are slightly higher. Hence, with the same R value the number of 
iterations is expected to increase shghtly. In all cases, the number of GMRES iterations is 
bounded by 10. 



Absorbing boundary condition. In the last example, we replace the PML with the 
second order absorbing boundary condition (ABC). The velocity field c{x) is taken to be 
one and we perform tests with two different external forces, which are similar to the ones 
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Table 8: Results of the depth stepping example for different u. Top: Solutions for two test 
cases with (jj/{27t) = 64. Bottom: Results for different uj. 
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Table 9: Results of the mixed boundary condition example for different uj. Top: Solutions 
for two test cases with (jj/(27v) = 64. Bottom: Results for different cj. 
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given at the beginning of Section 3.1 However, since the low order ABCs generate more 
non-physical reflections at the domain boundaries, we move the support of these external 
forces closer to the center of the computational domain. 

1. The first external force f{x) is a Gaussian point source located at (xi, X2) = (0.5, 0.25). 

2. The second external force f(x) is a Gaussian wave packet with a wavelength com- 
parable to the typical wavelength of the Helmholtz equation. This packet centers at 
(xi,X2) = (0.25,0.25) and points to the (1, 1) direction. 

Due to the same non-physical reflections, the discrete Green's function associated with a 
low order ABC often has off-diagonal blocks with higher numerical ranks compared to the 
discrete Green's function associated with the PML. As a result, we let R increase slightly 
with (jj. The sweeping direction is bottom- up from X2 = to X2 = 1 and the results are 



summarized in Table [10) The setup time grows slightly higher than linear complexity due 
to the increase of R. The number of iteration increases roughly logarithmically with respect 
to uj. In all cases, the number of GMRES iterations is bounded by 13. Overall the results 
for the ABC compares slightly worse than the ones of the PML, suggesting that, in order for 
the sweeping preconditioner to work well, it is essential to minimize non-physical reflections 
at the domain boundary. 
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Table 10: Results of the absorbing boundary condition (ABC) test for different u. Top: 
Solutions for two test cases with cj/(27r) = 64. Bottom: Results for different u. 
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4 Preconditioner in 3D 

4. 1 Discretization 

The computational domain is D = (0, 1)^. Using the same a{t) defined in Q, we define 



5l(xi)= 1+Z^-^ , 52(X2)= 1 + Z-^ ^ 



(jj 



UJ 



53(^3) = 1 + i 



CJ 



The PML replaces di with 5i(xi)5i, 82 with 52(x2)52, and ^3 with S'^{x'^)d'i. This effectively 
provides a damping layer of width 77 near the boundary oi D — (0,1)^. The resulting 
equation is 



{sidi){sidi) + {S2d2){s2d2) + {ssd3){ssds) + 



UJ 



c^{x) 



u = f xeD= [0,l]^ 
u^O X edD. 



Without loss of generality, we assume that f{x) is supported inside [77, 1 — 77]^ (away from 
the PML). Dividing the above equation by 5152^3 results 



<^i o ^ , o f _^ o ^ , o / <53 



S2S3 J \S1S2, J \S1S2 J SiS2S2,C^[x) 



UJ 



u^f. 



The domain [0, 1]^ is discretized with a Cartesian grid with spacing h — l/(n +1). As we 
discretize the equation with a couple number of points per wavelength, the number n of 
samples in each dimension is proportional to uj. The interior points of this grid are 

T^ = te,j,fc = (ihjh.kh) :1 <ij,k <n} 
(see Figure JT] (left)) and the total number of points is equal to A^ = n^. 





Figure 7: Left: Discretization grid in 3D. Right: Sweeping order in 3D. The remaining grid 
shows the unknowns yet to be processed. 

We denote by Uij^ki fi,j,ki and Cij^k the values of u{x), /(x), and c{x) at point Pij^k — 
{ih^jh^ kh). The 7-point stencil finite difference method writes down the equation at points 



25 



in V using central difference. The resulting equation at Pi^j^k = iih,jh, kh) is 



1 / SI 



1 / SI 



1,9 \ / ^i—^J^k ^79. 







UiJ,k-l + ^2 



1 / ss 



SlS2 



UiJ,k+l 



^JJ + -^ 



UJ 



{siS2Ss)iJ,k • cIjj^ 



"(•••) I ^ij,k — JiJ,l 



with Ui'^j'^k' equal to zero for {i'^j'^k') that violates 1 < i'^j'^k' < n. Here (•••) stands 
for the sum of the six coefficients appeared in the first two lines. We order ui^j^k by going 
through the dimensions in order and denote the vector containing all unknowns by 

U = ('^1,1,1, ^^2, 1,1, • • • , '^71,1,1, • • • , Ui^n,ni '^2,n,n, • • • , Un^n,n) • 

Similarly, fij^k ctre ordered in the same way and the vector / is 

/ ^ v/l,l,l5 /2,l,l5 • • • 5 /?2,l,l5 • • • 5 /l, 72, 725 /2, 72, 725 ' ' ' 5 /72,72,72J • 

The whole system takes the form Au = /. We further introduce a block version. Define 
Vm to be the indices in the 777.-th row 



and introduce 



Un 



' m — 1^1,1,7715^2,1,7715 • • • iPn,n,mj 



V'^1,1,7T25 '^2,1,7725 ' ' ' 5 ^^n^n^m) 5 7777 v/l, 1,7775 72,1,7775 ' ' ' 5 Jn,n,m) 



Then 



Using these notations, the system Au — f takes the following block tridiagonal form 

Ml,l ^1,2 
^2,1 ^2,2 



v 



^77—1,77 

^77,77—1 ^n,n 





(ui\ 




//l\ 




U2 


^ 


h 




\Un) 




\fn/ 



where each block Ao .• is of size n^ x n^ and Ar, 



At 



1^, 7 j^^ ^-L oizjc; 10 /\ I L ojlixx jr\.jji jji — \ — -^m 1 777 ^-^^ Qiagonal matriccs. 

Similar to the 2D case, the sweeping factorization eliminates the unknowns face by face, 
starting from the face next to X3 = (illustrated in Figure [t] (right)). The algorithms for 
constructing and applying the sweeping factorization are exactly the same as Algorithms 



2.1 



and 2.2). The matrix T^ = S^^ is now the discrete half-space Green's function with 



zero boundary condition at X3 = (777. + l)/i, restricted to the points on X3 = mh. Recall 
that in the 2D case the off-diagonal blocks of T^ is numerically low-rank. In the 3D case, 
this is no longer exactly true. On the other hand, since we only aim at constructing a 
preconditioner for the Helmholtz problem, it is still reasonable to introduce a hierarchical 
structure on the unknowns on the face X3 = mh and use the hierarchical matrix framework 
to approximate T^ and Sm- 
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4.2 Hierarchical matrix representation 

At the 7Ti-th layer for any fixed m, we build a hierarchical structure for the grid points in 
Vm through bisections in both xi and X2 directions. At the top level (level 0), the set 

At level i, there are 2^ x 2^ index sets J7^^-, z, j = 1, . . . , 2^ 

J!j = {Ps,t,m : (^ - 1) • n/2^ + 1 < 5 < z . n/2\ {j - 1) • n/2^ + 1 < t < j • n/2^}. 

The bisection is stopped when each set J^- contains only a small number of indices. Hence, 
the number of total levels L is equal to log2 n— 0(1). This hierarchical partition is illustrated 
in Figure [8] (left)). 
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Figure 8: Hierarchical matrix representation. Left: hierarchical decomposition of the index 
set J for each layer. Right: Induced partitioning of the matrix T^ in the strongly admissible 
case. Blocks in white are stored in low-rank factorized form. Blocks in gray are stored 
densely. 
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We write Gijf-.J^f,-,) (the restriction of a matrix G to J^f- and sJS-/) as G^ 
strongly admissible case is used here and two index sets J^- and J^,-, on the same level i 
are considered well-separated from each other if max(|z — z'|, |j — /|) > 1. Recall that the 
interaction list of J^-- is defined to be the set of all index sets J^,-, such that J^-- is well- 
separated from J^^,-, but jT^^/s parent is not well-separated from J^f/A/^s parent. When J^f- and 



rj 



I'j' 



J[,-, are well-separated from each other, the numerical rank of their interaction G-- ■,-, is of 
order 0(n/2^). As the number of indices in J^- and J^,-, is equal to (n/2^)^, the numerical 
rank scales like the square root of the number of indices in each set. Therefore, it is still 
favorable to store the interaction G---,-, in a factorized form. In principle, the rank R of 
the factorized form should scale like 0{n/2^). As the construction cost of the approximate 
sweeping factorization scales like 0{R^n^\o^ n) — 0{R^N\o^ N)^ following this scaling 
can be rather costly in practice. Instead, we choose i? to be a rather small constant as the 
goal is only to construct a preconditioner. An illustration of this hierarchical representation 
is given in Figure [s] (right). 
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Once the details of the hierarchical matrix representation are determined, the construc- 
tion of the approximate LDL^ factorization and the application of its inverse take the same 



form as Algorithms 2.5 and |2.6[ respectively. The operator 



defined by Algorithm [2^6] is an approximate inverse and a good preconditioner of the discrete 
Helmholtz operator A. Therefore, we solve the preconditioner system 

MAu = Mf 

using the GMRES algorithm. As the cost of applying M to any vector is 0{Rin?\ogn) — 
0{RN log N), the total cost is 0{NjRn^logn) = 0{NjRN log N), where Nj is the number 
of iterations. The numerical results in Section [5] demonstrate that Nj and R are in practice 
rather small. 



5 Numerical Results in 3D 

In this section, we present several numerical results to illustrate the properties of the sweep- 
ing preconditioner described in Section[4J We use the GMRES method as the iterative solver 
with relative residue tolerance equal to 10~^. The examples in this seciton have the PML 
boundary condition specified at all sides. 

We consider three velocity fields in the domain [0, 1]^: 

1. The first velocity field is a converging lens with a Gaussian profile at the center of 
the domain (see Figure [9]^a)). 

2. The second velocity field is a vertical waveguide with Gaussian cross section (see 
Figure [9]^b)). 

3. The third velocity field is a random velocity field (see Figure [oFc)). 
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Figure 9: Test velocity fields. For each velocity field, the cross sections at xi = 0.5, X2 = 0.5, 
and X3 = 0.5 are shown. 

For each problem, we test with two external forces f{x). 

1. The first external force f(x) is a Gaussian point source located at {xi,X2,xs) = 
(0.5, 0.5, 0.25). The response of this forcing term generates spherical waves propagat- 
ing at all directions. Due to the variations of the velocity field, the circular waves 
should bend and form caustics. 
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2. The second external force f{x) is a Gaussian wave packet whose wavelength is com- 
parable to the typical wavelength of the domain. This packet centers at (xi, X2, xs) = 
(0.5, 0.25, 0.25) and points to the (0, 1, 1) direction. The response of this forcing term 
generates a Gaussian beam initially pointing towards the (0, 1, 1) direction. 

For each velocity field, we perform tests for ^ equal to 5, 10, 20. In these tests, we 
discretize with g = 8 points per wavelength. Hence, the number of points in each dimension 
is n = 40, 80, 160. Recall that R is the rank of the factorized form of the hierarchical matrix 



representation. It is clear from the discussion of Section [42] that the value of R should grow 
with (jj (and n). Here, we choose i? = 2, 3, 4 for cj = 5, 10, 20, respectively. The sweeping 
direction is bottom- up from X3 = to X3 = 1. 

The two plots show 



The results of the first velocity field are reported in Table 11 



the solutions of the two external forces on a plane near xi = 1/2. Tgetup is the time 
used to construct the preconditioner in seconds. A^iter is the number of iterations of the 
preconditioned GMRES solver and Tgoive is the solution time. The analysis in Section 4^ 

0(R^ N \og^ N). When cj grows 



n) 



shows that the setup time scales like 0(i? n'^log 
from 5 to 20, since R increases from 2 to 4, Tgetup increases by a factor of 20 times each 
time (jj doubles. Though the setup cost grows significantly faster than the linear scaling 
0{N), it is still much better than the 0(A^^) scaling of the multifrontal method. A nice 
feature of the sweeping preconditioner is that the number of iterations is extremely small. 
In fact, in all cases, the GMRES solver converges in at most 7 iterations. Finally, we would 
like to point out that our algorithm is quite efficient: for the case with uj/{2ti) — 20 with 
more than four million unknowns, the solution time is only about 3 minutes. 



The results of the second and the third velocity fields are reported in Tables 12 and 



13, respectively. In all cases, the GMRES solver converges in at most 5 iterations when 



combined with the sweeping preconditioner. 
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Table 11: Results of velocity field 1 for different uo. Top: Solutions for two external forces 
with uj/{2ti) = 20 on a plane near xi — 0.5. Bottom: Results for different uj. 
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Table 12: Results of velocity field 2 for different cj. Top: Solutions for two external forces 
with (jj/{27t) = 20 on a plane near xi = 0.5. Bottom: Results for different u. 
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Table 13: Results of velocity field 3 for different cj. Top: Solutions for two external forces 
with (jj/{27t) = 20 on a plane near xi = 0.5. Bottom: Results for different u. 

6 Conclusion and Future Work 

In this paper, we have proposed a sweeping preconditioner for the iterative solution of 
variable coefficient Helmholtz equations in two and three dimensions. The construction of 
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the preconditioner is based on an approximate block LDL^ factorization that ehminates the 
unknowns layer by layer starting from an absorbing layer. By representing and manipulating 
the intermediate Schur complement matrices in the hierarchical matrix framework, we have 
obtained preconditioners with almost linear cost. Numerical examples demonstrate that, 
when combined with standard iterative solvers, these new preconditioners result almost 
cj-independent iteration numbers. 

Some questions remain open. First, in the 2D case, we have proved the compressibility 
result under the constant coefficient case. A natural question is to what extent this is still 
true for a general velocity field. 

The hierarchical matrix representation may not be very accurate for the Schur com- 
plement matrices in 3D, since some high-rank off-diagonal blocks are stored in a low-rank 
factorized form. Yet our algorithm works well with very small iteration numbers. It is 
important to understand why this is the case and also to investigate whether other matrix 
representations would be able to provide more accurate approximations for T^. 

The memory space required by the sweeping preconditioners is linear with respect to the 
number of unknowns. However, the prefactor is higher compared to the shifted Laplacian 
preconditioners and the ILU preconditioners. Most of the memory space is in fact used to 
store the diagonal part of T^, which corresponds to the local part of the half-space Green's 
function. One improvement is to use the asymptotic formula of the Green's function to 
represent the local part analytically and this can eliminate the need of storing the diagonal 
part of the hierarchical matrices. 

The matrix representation used here is often referred as the the TH} form of the hierar- 
chical matrix algebra. More efficient and sophisticated versions are the uniform %^ form 



and the IH? form. For our problem. Algorithm Z5 requires the matrices to be represented 
in the TH} form since it uses the matrix inversion procedure. However, Algorithm 2.6 of 
applying the sweeping preconditioner can potentially speed up dramatically when the TH? 
form is used. 

We have chosen the PML for the numerical implementation of the Sommerfeld condition. 
Many other boundary conditions are available and commonly used. The sweeping approach 
should work for these boundary conditions, as we have brieffy demonstrated for the second 
order ABC. The design and implementation of these other boundary conditions should 
minimize non-physical reffections in order for the sweeping preconditioner to do well. 

The second order central difference scheme is used to discretize the Helmholtz equation 
in this paper. We would like to investigate other more accurate stencils and other types of 
discretizations such as h/p finite elements, spectral elements, and discontinuous Galerkin 
methods. 

Since high frequency fields typically oscillate rapidly on a similar scale throughout the 
computational domain, uniform grids are very common. There are however situation where 
unstructured grids would be natural. The sweeping approach and more general hierarchical 
matrix representations can also be used in this context. The challenge here is to maintain 
compatibility between the matrix representation and the geometry as one sweeps through 
the computational domain. In a second paper [19] another variant of sweeping precon- 
ditioning is presented, which is more fiexible with respect to unstructured and adaptive 
grids. 

The sequential nature of the sweeping approach complicates parallelization of the algo- 
rithm. One possibility is to use parallel hierarchical matrix representation for each layer. 
This would parallelize an inner part of the algorithm. Another technique would leverage the 
idea of domain decomposition and use the sweeping preconditioner within each subdomain. 
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The subdomains should then be coupled with absorbing boundary conditions. 

The Helmholtz equation is only the simplest example of time-harmonic wave equations. 
Other cases include elasticity equation and Maxwell equations. For these more complicated 
systems, multiple wave numbers coexist even for the constant coefficient case. The basic 
idea of the sweeping preconditioner should apply but the details need to be worked out. 
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